5 research outputs found

    Model compression methods for convolutional neural networks

    Get PDF
    Deep learning has been found to be an effective solution to many problems in the field of computer vision. Convolutional networks have been a particularly successful model for computer vision. Convolutional neural networks extract feature maps from an image, then use the feature maps to determine to which of the preset categories the image belongs. Convolutional neural networks can be trained on a powerful machine, and then deployed onto a target device for inference. Computing inference has become feasible on mobile phones and IoT edge devices. However, these devices come with constraints like reduced processing resources, smaller memory caches, decreased memory bandwidth. To make computing inference practical on these devices, the effectiveness of various model compression methods is evaluated quantitatively. Methods are evaluated by applying them on a simple convolutional neural network for optical vehicle classification. Convolutional layers are separated into component vectors for a reduction in inference time on CPU, GPU, and an embedded target. Fully connected layers are pruned and retuned in combination with regularization and dropout. Pruned layers are compressed using a sparse matrix format. All optimizations are tested on three platforms with varying capabilities. Separation of convolutional layers improves latency of the whole model by 3.00x on a CPU platform. Using a sparse format on a pruned model with a large fully connected layer improves latency of the whole model by 2.01x on desktop with a GPU and by 1.82x on the embedded platform. On average, pruning the model allows 39.1x reduction in total model size while causing a 1.67 %-point reduction in accuracy, when dropout is used to control overfitting. This allows for a vehicle classifier to fit in 180 kB of memory with reasonable reduction in accuracy

    Memory Mapped I/O Register Test Case Generator for Large Systems-on-Chip

    Get PDF
    This paper addresses automated testing of a massive number of Memory Mapped Input/Output (MMIO) registers in a real large-scale Systems-on-Chip (SoC). The golden reference is an IP-XACT hardware description that includes a global memory map. The memory addresses for peripheral registers are required by software developers to access the peripherals from software.However, frequent hardware changes occur during the HW design process, but the changes might not always propagate to the SW developers and an incorrect memory map can cause unexpected behaviour and critical errors. Our goal is to ensure that the memory map corresponds exactly to the HW description.The correctness of the memory map can be verified by writing software test cases that access all MMIO-registers. Writing them manually is time consuming and error prone, for which reason we present a test case generator. We use a Rust-based software stack, where the generator itself is written in Rust while the generator input is in CMSIS-SVD-format that is generated from IP-XACT. We have used the generator extensively in Tampere SoC Hub Ballast and Headsail SoCs and fixed several errors before the chips manufacturing. The test generator can be used with any IP-XACT based SoCs.Peer reviewe

    A Resilient System Design to Boot a RISC-V MPSoC

    Get PDF
    This paper presents a highly resilient boot process design for Ballast, a new RISC- V based multiprocessor system-on-chip (SoC). An open source RISC- V SoC was adapted as a bootstrap processor and customized to meet our requirement for guaranteed chip wake-up. We outline the characteristic challenges of implementing a large program into a read-only memory (ROM) used for booting and propose generally applica-ble workflows to verify the boot process for application specific integrated circuit (ASIC) synthesis. We implemented four distinct boot modes. Two modes that load a software bootloader autonomously from an SD card are implemented for a secure digital input output (SDIO) interface and for a serial peripheral interface (SPI), respectively. Another SDIO based mode allows for direct program execution from external memory, while the last mode is based on usage of a RISC- V debug module. The boot process was verified with instruction set simulation, register transfer level simulation, gate-level simulation and field-programmable gate array prototyping. We received the fabricated ASIC samples and were able to successfully boot the chip via all boot modes on our custom circuit board.Peer reviewe

    Model compression methods for convolutional neural networks

    Get PDF
    Deep learning has been found to be an effective solution to many problems in the field of computer vision. Convolutional networks have been a particularly successful model for computer vision. Convolutional neural networks extract feature maps from an image, then use the feature maps to determine to which of the preset categories the image belongs. Convolutional neural networks can be trained on a powerful machine, and then deployed onto a target device for inference. Computing inference has become feasible on mobile phones and IoT edge devices. However, these devices come with constraints like reduced processing resources, smaller memory caches, decreased memory bandwidth. To make computing inference practical on these devices, the effectiveness of various model compression methods is evaluated quantitatively. Methods are evaluated by applying them on a simple convolutional neural network for optical vehicle classification. Convolutional layers are separated into component vectors for a reduction in inference time on CPU, GPU, and an embedded target. Fully connected layers are pruned and retuned in combination with regularization and dropout. Pruned layers are compressed using a sparse matrix format. All optimizations are tested on three platforms with varying capabilities. Separation of convolutional layers improves latency of the whole model by 3.00x on a CPU platform. Using a sparse format on a pruned model with a large fully connected layer improves latency of the whole model by 2.01x on desktop with a GPU and by 1.82x on the embedded platform. On average, pruning the model allows 39.1x reduction in total model size while causing a 1.67 %-point reduction in accuracy, when dropout is used to control overfitting. This allows for a vehicle classifier to fit in 180 kB of memory with reasonable reduction in accuracy

    Transpiling Python to Rust for Optimized Performance

    Get PDF
    Python has become the de facto programming language in machine learning and scientific computing, but high performance implementations are challenging to create especially for embedded systems with limited resources. We address the challenge of compiling and optimizing Python source code for a low-level target by introducing Rust as an intermediate source code step. We show that pre-existing Python implementations that depend on optimized libraries, such as NumPy, can be transpiled to Rust semi-automatically, with potential for further automation. We use two representative test cases, Black–Scholes for financial options pricing and robot trajectory optimization. The results show up to 12× speedup and 1.5× less memory use on PC, and the same performance but 4× less memory use on an ARM processor on PYNQ SoC FPGA. We also present a comprehensive list of factors for the process, to show the potential for fully automated transpilation. Our findings are generally applicable and can improve the performance of many Python applications while keeping their easy programmability.acceptedVersionPeer reviewe
    corecore